-
Notifications
You must be signed in to change notification settings - Fork 594
HDDS-7738. SCM terminates when adding container to a closed pipeline #4154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@szetszwo @errose28 @aswinshakil please have a look. Some CI integration steps are failing but they're fine when running locally from my laptop, or by the branch CI, guess they're just flaky. Would be nice if one of you can help retry them individually. |
|
Does the SCM terminate on the active SCM, or is this on the follower SCMs? If we allow an open container on a closed pipeline, what will close the open container? The normal close flow is triggered when either the container fills up and the DN triggers a close, or the pipeline is closed and it triggers a close to all containers on the pipeline. I am also wondering, what happens to a container which is allocated on SCM, but never gets anything written to it. It will never get replicas on a DN, and hence will never have any replicas reported. Will it get cleaned up or will it hang around forever? |
|
Thanks for having a look @sodonnel .
The same transactions get replayed in all SCMs and result the same errors preventing SCM to start up.
I think such containers will be closed by the pipeline scrubber, which periodically scans and closes containers associated with closed pipelines.
I'm not sure about this. Basically, I can't find any process that cleans up empty containers and it looks like a container can only be removed via admin CLI. Alternatively, SCM can also just reject the transaction (throwing a non-terminus) and move on. Yet, I'm not confident about the consequences. |
szetszwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@duongkame , thanks a lot for working on this!
- We should print a WARN message when it happens; see below.
- Also, let's keep the
addContainerToPipelineSCMStartmethod for now so that it is easier to back port this change. We may do the code refactoring later.
+ b/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/PipelineStateMap.java
@@ -106,9 +106,10 @@ void addContainerToPipeline(PipelineID pipelineID, ContainerID containerID)
Pipeline pipeline = getPipeline(pipelineID);
if (pipeline.isClosed()) {
- throw new IOException(String
- .format("Cannot add container to pipeline=%s in closed state",
- pipelineID));
+ LOG.warn("Adding container {} to pipeline={} in CLOSED state."
+ + " This happens only for some exceptional cases."
+ + " Check for the previous exceptions.",
+ containerID, pipelineID);
}
pipeline2container.get(pipelineID).add(containerID);
}
Thanks for the suggestions, @szetszwo. I've made the updates. |
szetszwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 the change looks good.
* master: (176 commits) HDDS-7726. EC: Enhance datanode reconstruction log message (apache#4155) HDDS-7739. EC: Increase the information in the RM sending command log message (apache#4153) HDDS-7652. Volume Quota not enforced during write when bucket quota is not set (apache#4124) HDDS-7628. Intermittent failure in TestOzoneContainerWithTLS (apache#4142) HDDS-7695. EC metrics related to replication commands don't add up (apache#4152) HDDS-7729. EC: ECContainerReplicaCount should handle pending delete of unhealthy replicas (apache#4146) HDDS-7738. SCM terminates when adding container to a closed pipeline (apache#4154) HDDS-7243. Remove RequestFeatureValidator from echoRPC method which supports only ValidationCondition.OLDER_CLIENT_REQUESTS (apache#4051) HDDS-7708. No check for certificate duration config scenarios. (apache#4149) HDDS-7727. EC: SCM unregistered event handler for DatanodeCommandCountUpdated (apache#4147) HDDS-7606. Add SCM HA support in intellij run (apache#4058) HDDS-7666. EC: Unrecoverable EC containers with some remaining replicas may block decommissioning (apache#4118) HDDS-7339. Implement Certificate renewal task for services (apache#3982) HDDS-7696. MisReplicationHandler does not consider QUASI_CLOSED replicas as sources (apache#4144) HDDS-7714. Docker cluster ozone-om-ha fails during docker-compose up (apache#4137) HDDS-7716. Log read requests rejected with permission denied in OM audit (apache#4136) HDDS-7588. Intermittent failure in TestObjectStoreWithLegacyFS#testFlatKeyStructureWithOBS (apache#4040) HDDS-7633. Compile error with Java 11: package com.sun.jmx.mbeanserver is not visible (apache#4077) HDDS-7648. Add a servername tag in UGI metrics. (apache#4094) HDDS-7564. Update Ozone version after 1.3.0 release (apache#4115) ...
What changes were proposed in this pull request?
SCM should allow adding a container to a CLOSED pipeline as a pipeline state can be changed while the container creating transaction is waiting to be processed by SCM.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-7738
How was this patch tested?
Unit tests.
Standard CI: https://github.com/duongkame/ozone/actions/runs/3857497185/jobs/6574962174